{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "pycharm-f7427e7c",
   "display_name": "PyCharm (pythonProject)",
   "language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "<center><h1>第七章 缺失数据</h1></center>"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "source": [
    "## 一、缺失值的统计和删除\n",
    "### 1. 缺失信息的统计\n",
    "\n",
    "缺失数据可以使用`isna`或`isnull`（两个函数没有区别）来查看每个单元格是否缺失，结合`mean`可以计算出每列缺失值的比例："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "   Grade   Name  Gender  Height  Weight  Transfer\n",
       "0  False  False   False   False   False     False\n",
       "1  False  False   False   False   False     False\n",
       "2  False  False   False   False   False     False\n",
       "3  False  False   False    True   False     False\n",
       "4  False  False   False   False   False     False"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Height</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>True</td>\n      <td>False</td>\n      <td>False</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n      <td>False</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer'])\n",
    "df.isna().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "Grade       0.000\n",
       "Name        0.000\n",
       "Gender      0.000\n",
       "Height      0.085\n",
       "Weight      0.055\n",
       "Transfer    0.060\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "df.isna().mean() # 查看缺失的比例"
   ]
  },
  {
   "source": [
    "如果想要查看某一列缺失或者非缺失的行，可以利用`Series`上的`isna`或者`notna`进行布尔索引。例如，查看身高缺失的行："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "        Grade          Name  Gender  Height  Weight Transfer\n",
       "3   Sophomore  Xiaojuan Sun  Female     NaN    41.0        N\n",
       "12     Senior      Peng You  Female     NaN    48.0      NaN\n",
       "26     Junior     Yanli You  Female     NaN    48.0        N\n",
       "36   Freshman  Xiaojuan Qin    Male     NaN    79.0        Y\n",
       "60   Freshman    Yanpeng Lv    Male     NaN    65.0        N"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Height</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>3</th>\n      <td>Sophomore</td>\n      <td>Xiaojuan Sun</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>41.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>Senior</td>\n      <td>Peng You</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>48.0</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>26</th>\n      <td>Junior</td>\n      <td>Yanli You</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>48.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>36</th>\n      <td>Freshman</td>\n      <td>Xiaojuan Qin</td>\n      <td>Male</td>\n      <td>NaN</td>\n      <td>79.0</td>\n      <td>Y</td>\n    </tr>\n    <tr>\n      <th>60</th>\n      <td>Freshman</td>\n      <td>Yanpeng Lv</td>\n      <td>Male</td>\n      <td>NaN</td>\n      <td>65.0</td>\n      <td>N</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "df[df.Height.isna()].head()"
   ]
  },
  {
   "source": [
    "如果想要同时对几个列，检索出全部为缺失或者至少有一个缺失或者没有缺失的行，可以使用`isna, notna`和`any, all`的组合。例如，对身高、体重和转系情况这3列分别进行这三种情况的检索："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "      Grade          Name Gender  Height  Weight Transfer\n",
       "102  Junior  Chengli Zhao   Male     NaN     NaN      NaN"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Height</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>102</th>\n      <td>Junior</td>\n      <td>Chengli Zhao</td>\n      <td>Male</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "sub_set = df[['Height', 'Weight', 'Transfer']]\n",
    "df[sub_set.isna().all(1)] # 全部缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "        Grade           Name  Gender  Height  Weight Transfer\n",
       "3   Sophomore   Xiaojuan Sun  Female     NaN    41.0        N\n",
       "9      Junior        Juan Xu  Female   164.8     NaN        N\n",
       "12     Senior       Peng You  Female     NaN    48.0      NaN\n",
       "21     Senior  Xiaopeng Shen    Male   166.0    62.0      NaN\n",
       "26     Junior      Yanli You  Female     NaN    48.0        N"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Height</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>3</th>\n      <td>Sophomore</td>\n      <td>Xiaojuan Sun</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>41.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>Junior</td>\n      <td>Juan Xu</td>\n      <td>Female</td>\n      <td>164.8</td>\n      <td>NaN</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>Senior</td>\n      <td>Peng You</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>48.0</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>Senior</td>\n      <td>Xiaopeng Shen</td>\n      <td>Male</td>\n      <td>166.0</td>\n      <td>62.0</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>26</th>\n      <td>Junior</td>\n      <td>Yanli You</td>\n      <td>Female</td>\n      <td>NaN</td>\n      <td>48.0</td>\n      <td>N</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "df[sub_set.isna().any(1)].head() # 至少有一个缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "       Grade            Name  Gender  Height  Weight Transfer\n",
       "0   Freshman    Gaopeng Yang  Female   158.9    46.0        N\n",
       "1   Freshman  Changqiang You    Male   166.5    70.0        N\n",
       "2     Senior         Mei Sun    Male   188.9    89.0        N\n",
       "4  Sophomore     Gaojuan You    Male   174.0    74.0        N\n",
       "5   Freshman     Xiaoli Qian  Female   158.0    51.0        N"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Height</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Freshman</td>\n      <td>Gaopeng Yang</td>\n      <td>Female</td>\n      <td>158.9</td>\n      <td>46.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Freshman</td>\n      <td>Changqiang You</td>\n      <td>Male</td>\n      <td>166.5</td>\n      <td>70.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Senior</td>\n      <td>Mei Sun</td>\n      <td>Male</td>\n      <td>188.9</td>\n      <td>89.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Sophomore</td>\n      <td>Gaojuan You</td>\n      <td>Male</td>\n      <td>174.0</td>\n      <td>74.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>Freshman</td>\n      <td>Xiaoli Qian</td>\n      <td>Female</td>\n      <td>158.0</td>\n      <td>51.0</td>\n      <td>N</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 7
    }
   ],
   "source": [
    "df[sub_set.notna().all(1)].head() # 没有缺失"
   ]
  },
  {
   "source": [
    "### 2. 缺失信息的删除\n",
    "\n",
    "数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除，`pandas`中提供了`dropna`函数来进行操作。\n",
    "\n",
    "`dropna`的主要参数为轴方向`axis`（默认为0，即删除行）、删除方式`how`、删除的非缺失值个数阈值`thresh`（$\\color{red}{非缺失值}$没有达到这个数量的相应维度会被删除）、备选的删除子集`subset`，其中`how`主要有`any`和`all`两种参数可以选择。\n",
    "\n",
    "例如，删除身高体重至少有一个缺失的行："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "(174, 6)"
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "res = df.dropna(how = 'any', subset = ['Height', 'Weight'])\n",
    "res.shape"
   ]
  },
  {
   "source": [
    "例如，删除超过15个缺失值的列："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "       Grade            Name  Gender  Weight Transfer\n",
       "0   Freshman    Gaopeng Yang  Female    46.0        N\n",
       "1   Freshman  Changqiang You    Male    70.0        N\n",
       "2     Senior         Mei Sun    Male    89.0        N\n",
       "3  Sophomore    Xiaojuan Sun  Female    41.0        N\n",
       "4  Sophomore     Gaojuan You    Male    74.0        N"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Freshman</td>\n      <td>Gaopeng Yang</td>\n      <td>Female</td>\n      <td>46.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Freshman</td>\n      <td>Changqiang You</td>\n      <td>Male</td>\n      <td>70.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Senior</td>\n      <td>Mei Sun</td>\n      <td>Male</td>\n      <td>89.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Sophomore</td>\n      <td>Xiaojuan Sun</td>\n      <td>Female</td>\n      <td>41.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Sophomore</td>\n      <td>Gaojuan You</td>\n      <td>Male</td>\n      <td>74.0</td>\n      <td>N</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "res = df.dropna(1, thresh=df.shape[0]-15) # 身高被删除\n",
    "res.head()"
   ]
  },
  {
   "source": [
    "当然，不用`dropna`同样是可行的，例如上述的两个操作，也可以使用布尔索引来完成："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "(174, 6)"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "res = df.loc[df[['Height', 'Weight']].notna().all(1)]\n",
    "res.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "       Grade            Name  Gender  Weight Transfer\n",
       "0   Freshman    Gaopeng Yang  Female    46.0        N\n",
       "1   Freshman  Changqiang You    Male    70.0        N\n",
       "2     Senior         Mei Sun    Male    89.0        N\n",
       "3  Sophomore    Xiaojuan Sun  Female    41.0        N\n",
       "4  Sophomore     Gaojuan You    Male    74.0        N"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Grade</th>\n      <th>Name</th>\n      <th>Gender</th>\n      <th>Weight</th>\n      <th>Transfer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Freshman</td>\n      <td>Gaopeng Yang</td>\n      <td>Female</td>\n      <td>46.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Freshman</td>\n      <td>Changqiang You</td>\n      <td>Male</td>\n      <td>70.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Senior</td>\n      <td>Mei Sun</td>\n      <td>Male</td>\n      <td>89.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Sophomore</td>\n      <td>Xiaojuan Sun</td>\n      <td>Female</td>\n      <td>41.0</td>\n      <td>N</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Sophomore</td>\n      <td>Gaojuan You</td>\n      <td>Male</td>\n      <td>74.0</td>\n      <td>N</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "res = df.loc[:, ~(df.isna().sum()>15)]\n",
    "res.head()"
   ]
  },
  {
   "source": [
    "## 二、缺失值的填充和插值\n",
    "### 1. 利用fillna进行填充\n",
    "\n",
    "在`fillna`中有三个参数是常用的：`value, method, limit`。其中，`value`为填充值，可以是标量，也可以是索引到元素的字典映射；`method`为填充方法，有用前面的元素填充`ffill`和用后面的元素填充`bfill`两种类型，`limit`参数表示连续缺失值的最大填充次数。\n",
    "\n",
    "下面构造一个简单的`Series`来说明用法："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "a    NaN\n",
       "a    1.0\n",
       "a    NaN\n",
       "b    NaN\n",
       "c    2.0\n",
       "d    NaN\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 12
    }
   ],
   "source": [
    "s = pd.Series([np.nan, 1, np.nan, np.nan, 2, np.nan], list('aaabcd'))\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "a    NaN\n",
       "a    1.0\n",
       "a    1.0\n",
       "b    1.0\n",
       "c    2.0\n",
       "d    2.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "s.fillna(method='ffill') # 用前面的值向后填充"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "a    NaN\n",
       "a    1.0\n",
       "a    1.0\n",
       "b    NaN\n",
       "c    2.0\n",
       "d    2.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 14
    }
   ],
   "source": [
    "s.fillna(method='ffill', limit=1) # 连续出现的缺失，最多填充一次"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "a    1.5\n",
       "a    1.0\n",
       "a    1.5\n",
       "b    1.5\n",
       "c    2.0\n",
       "d    1.5\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "source": [
    "s.fillna(s.mean()) # value为标量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "a    100.0\n",
       "a      1.0\n",
       "a    100.0\n",
       "b      NaN\n",
       "c      2.0\n",
       "d    200.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 16
    }
   ],
   "source": [
    "s.fillna({'a': 100, 'd': 200}) # 通过索引映射填充的值"
   ]
  },
  {
   "source": [
    "有时为了更加合理地填充，需要先进行分组后再操作。例如，根据年级进行身高的均值填充："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    158.900000\n",
       "1    166.500000\n",
       "2    188.900000\n",
       "3    163.075862\n",
       "4    174.000000\n",
       "Name: Height, dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 17
    }
   ],
   "source": [
    "df.groupby('Grade')['Height'].transform(lambda x: x.fillna(x.mean())).head()"
   ]
  },
  {
   "source": [
    "#### 【练一练】\n",
    "对一个序列以如下规则填充缺失值：如果单独出现的缺失值，就用前后均值填充，如果连续出现的缺失值就不填充，即序列`[1, NaN, 3, NaN, NaN]`填充后为`[1, 2, 3, NaN, NaN]`，请利用`fillna`函数实现。（提示：利用`limit`参数）\n",
    "#### 【END】\n",
    "### 2. 插值函数\n",
    "\n",
    "在关于`interpolate`函数的[文档](https://pandas.pydata.org/docs/reference/api/pandas.Series.interpolate.html#pandas.Series.interpolate)描述中，列举了许多插值法，包括了大量`Scipy`中的方法。由于很多插值方法涉及到比较复杂的数学知识，因此这里只讨论比较常用且简单的三类情况，即线性插值、最近邻插值和索引插值。\n",
    "\n",
    "对于`interpolate`而言，除了插值方法（默认为`linear`线性插值）之外，有与`fillna`类似的两个常用参数，一个是控制方向的`limit_direction`，另一个是控制最大连续缺失值插值个数的`limit`。其中，限制插值的方向默认为`forward`，这与`fillna`的`method`中的`ffill`是类似的，若想要后向限制插值或者双向限制插值可以指定为`backward`或`both`。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "array([nan, nan,  1., nan, nan, nan,  2., nan, nan])"
      ]
     },
     "metadata": {},
     "execution_count": 18
    }
   ],
   "source": [
    "s = pd.Series([np.nan, np.nan, 1, np.nan, np.nan, np.nan, 2, np.nan, np.nan])\n",
    "s.values"
   ]
  },
  {
   "source": [
    "例如，在默认线性插值法下分别进行`backward`和双向限制插值，同时限制最大连续条数为1："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "array([ nan, 1.  , 1.  ,  nan,  nan, 1.75, 2.  ,  nan,  nan])"
      ]
     },
     "metadata": {},
     "execution_count": 19
    }
   ],
   "source": [
    "res = s.interpolate(limit_direction='backward', limit=1)\n",
    "res.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "array([ nan, 1.  , 1.  , 1.25,  nan, 1.75, 2.  , 2.  ,  nan])"
      ]
     },
     "metadata": {},
     "execution_count": 20
    }
   ],
   "source": [
    "res = s.interpolate(limit_direction='both', limit=1)\n",
    "res.values"
   ]
  },
  {
   "source": [
    "第二种常见的插值是最近邻插补，即缺失值的元素和离它最近的非缺失值元素一样："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "array([nan, nan,  1.,  1.,  1.,  2.,  2., nan, nan])"
      ]
     },
     "metadata": {},
     "execution_count": 21
    }
   ],
   "source": [
    "s.interpolate('nearest').values"
   ]
  },
  {
   "source": [
    "最后来介绍索引插值，即根据索引大小进行线性插值。例如，构造不等间距的索引进行演示："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      0.0\n",
       "1      NaN\n",
       "10    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 22
    }
   ],
   "source": [
    "s = pd.Series([0,np.nan,10],index=[0,1,10])\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      0.0\n",
       "1      5.0\n",
       "10    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 23
    }
   ],
   "source": [
    "s.interpolate() # 默认的线性插值，等价于计算中点的值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      0.0\n",
       "1      1.0\n",
       "10    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 24
    }
   ],
   "source": [
    "s.interpolate(method='index') # 和索引有关的线性插值，计算相应索引大小对应的值"
   ]
  },
  {
   "source": [
    "同时，这种方法对于时间戳索引也是可以使用的，有关时间序列的其他话题会在第十章进行讨论，这里举一个简单的例子："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "2020-01-01     0.0\n",
       "2020-01-02     NaN\n",
       "2020-01-11    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 25
    }
   ],
   "source": [
    "s = pd.Series([0,np.nan,10], index=pd.to_datetime(['20200101', '20200102', '20200111']))\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "2020-01-01     0.0\n",
       "2020-01-02     5.0\n",
       "2020-01-11    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 26
    }
   ],
   "source": [
    "s.interpolate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "2020-01-01     0.0\n",
       "2020-01-02     1.0\n",
       "2020-01-11    10.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 27
    }
   ],
   "source": [
    "s.interpolate(method='index')"
   ]
  },
  {
   "source": [
    "#### 【NOTE】关于polynomial和spline插值的注意事项\n",
    "在`interpolate`中如果选用`polynomial`的插值方法，它内部调用的是`scipy.interpolate.interp1d(*,*,kind=order)`，这个函数内部调用的是`make_interp_spline`方法，因此其实是样条插值而不是类似于`numpy`中的`polyfit`多项式拟合插值；而当选用`spline`方法时，`pandas`调用的是`scipy.interpolate.UnivariateSpline`而不是普通的样条插值。这一部分的文档描述比较混乱，而且这种参数的设计也是不合理的，当使用这两类插值方法时，用户一定要小心谨慎地根据自己的实际需求选取恰当的插值方法。\n",
    "#### 【END】\n",
    "## 三、Nullable类型\n",
    "### 1. 缺失记号及其缺陷\n",
    "\n",
    "在`python`中的缺失值用`None`表示，该元素除了等于自己本身之外，与其他任何元素不相等："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "metadata": {},
     "execution_count": 28
    }
   ],
   "source": [
    "None == None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 29
    }
   ],
   "source": [
    "None == False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 30
    }
   ],
   "source": [
    "None == []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 31
    }
   ],
   "source": [
    "None == ''"
   ]
  },
  {
   "source": [
    "在`numpy`中利用`np.nan`来表示缺失值，该元素除了不和其他任何元素相等之外，和自身的比较结果也返回`False`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 32
    }
   ],
   "source": [
    "np.nan == np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 33
    }
   ],
   "source": [
    "np.nan == None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 34
    }
   ],
   "source": [
    "np.nan == False"
   ]
  },
  {
   "source": [
    "值得注意的是，虽然在对缺失序列或表格的元素进行比较操作的时候，`np.nan`的对应位置会返回`False`，但是在使用`equals`函数进行两张表或两个序列的相同性检验时，会自动跳过两侧表都是缺失值的位置，直接返回`True`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 35
    }
   ],
   "source": [
    "s1 = pd.Series([1, np.nan])\n",
    "s2 = pd.Series([1, 2])\n",
    "s3 = pd.Series([1, np.nan])\n",
    "s1 == 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 36
    }
   ],
   "source": [
    "s1.equals(s2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "metadata": {},
     "execution_count": 37
    }
   ],
   "source": [
    "s1.equals(s3)"
   ]
  },
  {
   "source": [
    "在时间序列的对象中，`pandas`利用`pd.NaT`来指代缺失值，它的作用和`np.nan`是一致的（时间序列的对象和构造将在第十章讨论）："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "TimedeltaIndex(['0 days 00:00:30', NaT], dtype='timedelta64[ns]', freq=None)"
      ]
     },
     "metadata": {},
     "execution_count": 38
    }
   ],
   "source": [
    "pd.to_timedelta(['30s', np.nan]) # Timedelta中的NaT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "DatetimeIndex(['2020-01-01', 'NaT'], dtype='datetime64[ns]', freq=None)"
      ]
     },
     "metadata": {},
     "execution_count": 39
    }
   ],
   "source": [
    "pd.to_datetime(['20200101', np.nan]) # Datetime中的NaT"
   ]
  },
  {
   "source": [
    "那么为什么要引入`pd.NaT`来表示时间对象中的缺失呢？仍然以`np.nan`的形式存放会有什么问题？在`pandas`中可以看到`object`类型的对象，而`object`是一种混杂对象类型，如果出现了多个类型的元素同时存储在`Series`中，它的类型就会变成`object`。例如，同时存放整数和字符串的列表："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      1\n",
       "1    two\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 40
    }
   ],
   "source": [
    "pd.Series([1, 'two'])"
   ]
  },
  {
   "source": [
    "`NaT`问题的根源来自于`np.nan`的本身是一种浮点类型，而如果浮点和时间类型混合存储，如果不设计新的内置缺失类型来处理，就会变成含糊不清的`object`类型，这显然是不希望看到的。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "float"
      ]
     },
     "metadata": {},
     "execution_count": 41
    }
   ],
   "source": [
    "type(np.nan)"
   ]
  },
  {
   "source": [
    "同时，由于`np.nan`的浮点性质，如果在一个整数的`Series`中出现缺失，那么其类型会转变为`float64`；而如果在一个布尔类型的序列中出现缺失，那么其类型就会转为`object`而不是`bool`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "dtype('float64')"
      ]
     },
     "metadata": {},
     "execution_count": 42
    }
   ],
   "source": [
    "pd.Series([1, np.nan]).dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "dtype('O')"
      ]
     },
     "metadata": {},
     "execution_count": 43
    }
   ],
   "source": [
    "pd.Series([True, False, np.nan]).dtype"
   ]
  },
  {
   "source": [
    "因此，在进入`1.0.0`版本后，`pandas`尝试设计了一种新的缺失类型`pd.NA`以及三种`Nullable`序列类型来应对这些缺陷，它们分别是`Int, boolean`和`string`。\n",
    "\n",
    "### 2. Nullable类型的性质\n",
    "\n",
    "从字面意义上看`Nullable`就是可空的，言下之意就是序列类型不受缺失值的影响。例如，在上述三个`Nullable`类型中存储缺失值，都会转为`pandas`内置的`pd.NA`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    <NA>\n",
       "1       1\n",
       "dtype: Int64"
      ]
     },
     "metadata": {},
     "execution_count": 44
    }
   ],
   "source": [
    "pd.Series([np.nan, 1], dtype = 'Int64') # \"i\"是大写的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    <NA>\n",
       "1    True\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 45
    }
   ],
   "source": [
    "pd.Series([np.nan, True], dtype = 'boolean')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      <NA>\n",
       "1    my_str\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 46
    }
   ],
   "source": [
    "pd.Series([np.nan, 'my_str'], dtype = 'string')"
   ]
  },
  {
   "source": [
    "在`Int`的序列中，返回的结果会尽可能地成为`Nullable`的类型："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    <NA>\n",
       "1       1\n",
       "dtype: Int64"
      ]
     },
     "metadata": {},
     "execution_count": 47
    }
   ],
   "source": [
    "pd.Series([np.nan, 0], dtype = 'Int64') + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    <NA>\n",
       "1    True\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 48
    }
   ],
   "source": [
    "pd.Series([np.nan, 0], dtype = 'Int64') == 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    NaN\n",
       "1    0.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 49
    }
   ],
   "source": [
    "pd.Series([np.nan, 0], dtype = 'Int64') * 0.5 # 只能是浮点"
   ]
  },
  {
   "source": [
    "对于`boolean`类型的序列而言，其和`bool`序列的行为主要有两点区别：\n",
    "\n",
    "第一点是带有缺失的布尔列表无法进行索引器中的选择，而`boolean`会把缺失值看作`False`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 50
    }
   ],
   "source": [
    "s = pd.Series(['a', 'b'])\n",
    "s_bool = pd.Series([True, np.nan])\n",
    "s_boolean = pd.Series([True, np.nan]).astype('boolean')\n",
    "# s[s_bool] # 报错\n",
    "s[s_boolean]"
   ]
  },
  {
   "source": [
    "第二点是在进行逻辑运算时，`bool`类型在缺失处返回的永远是`False`，而`boolean`会根据逻辑运算是否能确定唯一结果来返回相应的值。那什么叫能否确定唯一结果呢？举个简单例子：`True | pd.NA`中无论缺失值为什么值，必然返回`True`；`False | pd.NA`中的结果会根据缺失值取值的不同而变化，此时返回`pd.NA`；`False & pd.NA`中无论缺失值为什么值，必然返回`False`。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    True\n",
       "1    <NA>\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 51
    }
   ],
   "source": [
    "s_boolean & True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    True\n",
       "1    True\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 52
    }
   ],
   "source": [
    "s_boolean | True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    False\n",
       "1     <NA>\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 53
    }
   ],
   "source": [
    "~s_boolean # 取反操作同样是无法唯一地判断缺失结果"
   ]
  },
  {
   "source": [
    "关于`string`类型的具体性质将在下一章文本数据中进行讨论。\n",
    "\n",
    "一般在实际数据处理时，可以在数据集读入后，先通过`convert_dtypes`转为`Nullable`类型："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "School          string\n",
       "Grade           string\n",
       "Name            string\n",
       "Gender          string\n",
       "Height         float64\n",
       "Weight           Int64\n",
       "Transfer        string\n",
       "Test_Number      Int64\n",
       "Test_Date       string\n",
       "Time_Record     string\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 54
    }
   ],
   "source": [
    "df = pd.read_csv('../data/learn_pandas.csv')\n",
    "df = df.convert_dtypes()\n",
    "df.dtypes"
   ]
  },
  {
   "source": [
    "### 3. 缺失数据的计算和分组\n",
    "\n",
    "当调用函数`sum, prod`使用加法和乘法的时候，缺失数据等价于被分别视作0和1，即不改变原来的计算结果："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "14.0"
      ]
     },
     "metadata": {},
     "execution_count": 55
    }
   ],
   "source": [
    "s = pd.Series([2,3,np.nan,4,5])\n",
    "s.sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "120.0"
      ]
     },
     "metadata": {},
     "execution_count": 56
    }
   ],
   "source": [
    "s.prod()"
   ]
  },
  {
   "source": [
    "当使用累计函数时，会自动跳过缺失值所处的位置："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     2.0\n",
       "1     5.0\n",
       "2     NaN\n",
       "3     9.0\n",
       "4    14.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 57
    }
   ],
   "source": [
    "s.cumsum()"
   ]
  },
  {
   "source": [
    "当进行单个标量运算的时候，除了`np.nan ** 0`和`1 ** np.nan`这两种情况为确定的值之外，所有运算结果全为缺失（`pd.NA`的行为与此一致 ），并且`np.nan`在比较操作时一定返回`False`，而`pd.NA`返回`pd.NA`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 58
    }
   ],
   "source": [
    "np.nan == 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "metadata": {},
     "execution_count": 59
    }
   ],
   "source": [
    "pd.NA == 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "metadata": {},
     "execution_count": 60
    }
   ],
   "source": [
    "np.nan > 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<NA>"
      ]
     },
     "metadata": {},
     "execution_count": 61
    }
   ],
   "source": [
    "pd.NA > 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "metadata": {},
     "execution_count": 62
    }
   ],
   "source": [
    "np.nan + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "metadata": {},
     "execution_count": 63
    }
   ],
   "source": [
    "np.log(np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "nan"
      ]
     },
     "metadata": {},
     "execution_count": 64
    }
   ],
   "source": [
    "np.add(np.nan, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "metadata": {},
     "execution_count": 65
    }
   ],
   "source": [
    "np.nan ** 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "metadata": {},
     "execution_count": 66
    }
   ],
   "source": [
    "pd.NA ** 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "metadata": {},
     "execution_count": 67
    }
   ],
   "source": [
    "1 ** np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "metadata": {},
     "execution_count": 68
    }
   ],
   "source": [
    "1 ** pd.NA"
   ]
  },
  {
   "source": [
    "另外需要注意的是，`diff, pct_change`这两个函数虽然功能相似，但是对于缺失的处理不同，前者凡是参与缺失计算的部分全部设为了缺失值，而后者缺失值位置会被设为 0% 的变化率："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    NaN\n",
       "1    1.0\n",
       "2    NaN\n",
       "3    NaN\n",
       "4    1.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 69
    }
   ],
   "source": [
    "s.diff()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0         NaN\n",
       "1    0.500000\n",
       "2    0.000000\n",
       "3    0.333333\n",
       "4    0.250000\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 70
    }
   ],
   "source": [
    "s.pct_change()"
   ]
  },
  {
   "source": [
    "对于一些函数而言，缺失可以作为一个类别处理，例如在`groupby, get_dummies`中可以设置相应的参数来进行增加缺失类别："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "  category  value\n",
       "0        a      1\n",
       "1        a      3\n",
       "2        b      5\n",
       "3      NaN      7\n",
       "4      NaN      9"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>category</th>\n      <th>value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>a</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>a</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>b</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>NaN</td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>NaN</td>\n      <td>9</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 71
    }
   ],
   "source": [
    "df_nan = pd.DataFrame({'category':['a','a','b',np.nan,np.nan], 'value':[1,3,5,7,9]})\n",
    "df_nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "category\n",
       "a      2\n",
       "b      5\n",
       "NaN    8\n",
       "Name: value, dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 72
    }
   ],
   "source": [
    "df_nan.groupby('category', dropna=False)['value'].mean() # pandas版本大于1.1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "   a  b  NaN\n",
       "0  1  0    0\n",
       "1  1  0    0\n",
       "2  0  1    0\n",
       "3  0  0    1\n",
       "4  0  0    1"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>NaN</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0</td>\n      <td>1</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 73
    }
   ],
   "source": [
    "pd.get_dummies(df_nan.category, dummy_na=True)"
   ]
  },
  {
   "source": [
    "## 四、练习\n",
    "### Ex1：缺失值与类别的相关性检验\n",
    "在数据处理中，含有过多缺失值的列往往会被删除，除非缺失情况与标签强相关。下面有一份关于二分类问题的数据集，其中`X_1, X_2`为特征变量，`y`为二分类标签。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "    X_1  X_2  y\n",
       "0   NaN  NaN  0\n",
       "1   NaN  NaN  0\n",
       "2   NaN  NaN  0\n",
       "3  43.0  NaN  0\n",
       "4   NaN  NaN  0"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>X_1</th>\n      <th>X_2</th>\n      <th>y</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>43.0</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 74
    }
   ],
   "source": [
    "df = pd.read_csv('../data/missing_chi.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "X_1    0.855\n",
       "X_2    0.894\n",
       "y      0.000\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 75
    }
   ],
   "source": [
    "df.isna().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    0.918\n",
       "1    0.082\n",
       "Name: y, dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 76
    }
   ],
   "source": [
    "df.y.value_counts(normalize=True)"
   ]
  },
  {
   "source": [
    "事实上，有时缺失值出现或者不出现本身就是一种特征，并且在一些场合下可能与标签的正负是相关的。关于缺失出现与否和标签的正负性，在统计学中可以利用卡方检验来断言它们是否存在相关性。按照特征缺失的正例、特征缺失的负例、特征不缺失的正例、特征不缺失的负例，可以分为四种情况，设它们分别对应的样例数为$n_{11}, n_{10}, n_{01}, n_{00}$。假若它们是不相关的，那么特征缺失中正例的理论值，就应该接近于特征缺失总数$\\times$总体正例的比例，即：\n",
    "\n",
    "$$E_{11} = n_{11} \\approx (n_{11}+n_{10})\\times\\frac{n_{11}+n_{01}}{n_{11}+n_{10}+n_{01}+n_{00}} = F_{11}$$\n",
    "\n",
    "其他的三种情况同理。现将实际值和理论值分别记作$E_{ij}, F_{ij}$，那么希望下面的统计量越小越好，即代表实际值接近不相关情况的理论值：\n",
    "\n",
    "$$S = \\sum_{i\\in \\{0,1\\}}\\sum_{j\\in \\{0,1\\}} \\frac{(E_{ij}-F_{ij})^2}{F_{ij}}$$\n",
    "\n",
    "可以证明上面的统计量近似服从自由度为$1$的卡方分布，即$S\\overset{\\cdot}{\\sim} \\chi^2(1)$。因此，可通过计算$P(\\chi^2(1)>S)$的概率来进行相关性的判别，一般认为当此概率小于$0.05$时缺失情况与标签正负存在相关关系，即不相关条件下的理论值与实际值相差较大。\n",
    "\n",
    "上面所说的概率即为统计学上关于$2\\times2$列联表检验问题的$p$值， 它可以通过`scipy.stats.chi2.sf(S, 1)`得到。请根据上面的材料，分别对`X_1, X_2`列进行检验。\n",
    "\n",
    "### Ex2：用回归模型解决分类问题\n",
    "\n",
    "`KNN`是一种监督式学习模型，既可以解决回归问题，又可以解决分类问题。对于分类变量，利用`KNN`分类模型可以实现其缺失值的插补，思路是度量缺失样本的特征与所有其他样本特征的距离，当给定了模型参数`n_neighbors=n`时，计算离该样本距离最近的$n$个样本点中最多的那个类别，并把这个类别作为该样本的缺失预测类别，具体如下图所示，未知的类别被预测为黄色：\n",
    "\n",
    "<img src=\"../source/_static/ch7_ex.png\" width=\"25%\">\n",
    "\n",
    "上面有色点的特征数据提供如下："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "    X1   X2 Color\n",
       "0 -2.5  2.8  Blue\n",
       "1 -1.5  1.8  Blue\n",
       "2 -0.8  2.8  Blue"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>X1</th>\n      <th>X2</th>\n      <th>Color</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>-2.5</td>\n      <td>2.8</td>\n      <td>Blue</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>-1.5</td>\n      <td>1.8</td>\n      <td>Blue</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>-0.8</td>\n      <td>2.8</td>\n      <td>Blue</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 77
    }
   ],
   "source": [
    "df = pd.read_excel('../data/color.xlsx')\n",
    "df.head(3)"
   ]
  },
  {
   "source": [
    "已知待预测的样本点为$X_1=0.8, X_2=-0.2$，那么预测类别可以如下写出："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "array(['Yellow'], dtype=object)"
      ]
     },
     "metadata": {},
     "execution_count": 78
    }
   ],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "clf = KNeighborsClassifier(n_neighbors=6)\n",
    "clf.fit(df.iloc[:,:2], df.Color)\n",
    "clf.predict([[0.8, -0.2]])"
   ]
  },
  {
   "source": [
    "1. 对于回归问题而言，需要得到的是一个具体的数值，因此预测值由最近的$n$个样本对应的平均值获得。请把上面的这个分类问题转化为回归问题，仅使用`KNeighborsRegressor`来完成上述的`KNeighborsClassifier`功能。\n",
    "2. 请根据第1问中的方法，对`audit`数据集中的`Employment`变量进行缺失值插补。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "        ID  Age Employment    Marital     Income  Gender  Hours\n",
       "0  1004641   38    Private  Unmarried   81838.00  Female     72\n",
       "1  1010229   35    Private     Absent   72099.00    Male     30\n",
       "2  1024587   32    Private   Divorced  154676.74    Male     40"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>ID</th>\n      <th>Age</th>\n      <th>Employment</th>\n      <th>Marital</th>\n      <th>Income</th>\n      <th>Gender</th>\n      <th>Hours</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1004641</td>\n      <td>38</td>\n      <td>Private</td>\n      <td>Unmarried</td>\n      <td>81838.00</td>\n      <td>Female</td>\n      <td>72</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1010229</td>\n      <td>35</td>\n      <td>Private</td>\n      <td>Absent</td>\n      <td>72099.00</td>\n      <td>Male</td>\n      <td>30</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1024587</td>\n      <td>32</td>\n      <td>Private</td>\n      <td>Divorced</td>\n      <td>154676.74</td>\n      <td>Male</td>\n      <td>40</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 79
    }
   ],
   "source": [
    "df = pd.read_csv('../data/audit.csv')\n",
    "df.head(3)"
   ]
  }
 ]
}